Abstract

Large image collections are increasingly significant for media, scientific and other platforms. A case study of Facebook’s predictive modelling of satellite images of human settlement exemplifies how images en masse affect the platform, supporting its expansion and its infrastructuralisation. Under platform conditions, the operation of such models entails shifts in the position of human observers, rearrangements of image collections, reconfiguration of platform architecture and a significant general shift in the referential functioning of images. The treatment of images in a predictive model – a deep neural network – embeds an indexical field, a field that allows the platform to not only generate referential statements about the world, but position itself in the world. Under platform conditions, image collections function less as archives or records, or as surveillant gaze and more as densely-woven indexical fields that orient and locate the platform’s operations in everyday experience. In describing the transformation of image collections, the paper points to important changes in how platform media embed themselves in the world.

Keywords: images, platform, machine learning, indexicality, referentiality

Introduction

A fragment of a larger world is compressed into a piece …, impressing its surface with color and light without taking the position of a viewer external to it into account. No scale or human measure is assumed (Alpers 1983, 37)

Facebook’s Connectivity Lab announced in 2016 they had developed a machine learning model – ‘the High Resolution Settlement Layer’ (HRSL) – that identifies individual buildings ‘across the globe’ (Gros 2016; Metz 2016). The model ‘learned’ to see individual buildings from a collection of several billion satellite images.1 Recognizing and locating individual buildings, the Facebook project team counted ‘human artifacts’ down to five meter resolution:

We invoked Facebook’s image-recognition engine — based on a deep convolutional neural network that provides a fixed dimensional feature embedding for all images — and found that, with minor modifications, we could use the engine trained on normal photos to efficiently detect whether a satellite image contained a building. (Gros 2016)

A survey of human settlement is useful to platforms like Facebook: combining building locations with census data, a social network platform can plan the build-out of connectivity for people not yet connected to the internet. Network infrastructure can be placed where it is most likely to contribute to the growth of the platform’s user base. Satellites, drones, microwave wireless links and other Facebook infrastructure, including data centres, can be positioned accordingly. The model supports the ‘infrastructuralisation’ of the platform, as a recent sociologists of platforms term it (Plantin et al. 2016), the process whereby platform media insinuate themselves into everyday life as mundane affordances. Not coincidentally, as so often seems to be the case, Facebook aimed to demonstrate the platform’s epistemic capacity to ‘know’ the world better. As Noortje Marres writes, ‘forms of knowledge have become the focal point of controversy in digital societies’ (Marres 2017, 184).

Whether or not Facebook’s model of human settlement, purportedly the most detailed such model ever constructed, garners the platform new users is not so important for the purposes of this paper. The relevance of the example centres on how machine learning ‘embeds’ – a key term in this paper – large image collections in contemporary platforms, whether they are social networks, search engines, scientific knowledge infrastructures or government surveillance systems, autonomous vehicles or smart phones. The image collection directly feeding into HRSL comes from Digital Globe’s satellites (DigitalGlobe 2016; Lab and International Earth Science Information NetworkCIESIN 2017), which, like a dozen or so other earth imaging systems, photograph the earth at varying degrees of spatial resolution. But HRSL also relies indirectly on social network media image collections. The so-called ‘normal photos’ in Facebook’s announcement reportedly derive from the Instagram photo-sharing platform, a platform where faces, places, and specific things are labelled with hashtags by its many users (Shankland 2018). This mundane photo-sharing image collection with its tags provides a foothold for HRSL‘s identification of buildings in the DigitalGlobe satellite images: ’with minor modifications, we could use the engine trained on normal photos to efficiently detect whether a satellite image contained a building.’

Something quite complicated has happened around image collections. Facebook started life as an online archive of student photographs. The making, circulation and viewing of photographic images are richly differentiated yet ordinary cultural practices on platforms (Van Dijck 2013; Miller and Jolynna 2017). Visual and media studies have begun to account for the effects of platforms on images. Scholars have studied the effect of platforms and networks on image compression techniques (Mackenzie 2008; Cubitt 2014), file formats (Sterne 2012) and camera-enabled devices (Kember 2014; Mackenzie and Munster 2019). The challenges of finding, sorting, classifying, labelling and ranking images have led to far-ranging changes in platform architecture, and occasioned globally distributed data centres heavily tasked with rapid image retrieval.

Platform image architectures now go, however, well beyond their characterisation in terms of a generic ‘database logic’ aiming at retrieval (Manovich 2001).2 Archival retrieval operations have undoubtedly paved the way for the predictive observing and re-weaving of image referentiality. The state of affairs the HRSL model seeks to address is typical. It concerns vast accumulations of images whose value consists in aggregation not in individual experience. We can of course encounter, interpret or evaluate images singly. When we look at the image in Figure 1, we can say with some confidence ‘this is a house’, ‘this is a field’, or ‘this is a road.’ Whatever manipulations that image may undergo, even in presenting it here, our phenomenological investment in the relation between the photographic image and a pre-existing reality remains. As Tom Gunning emphasises, the referential effect of photographic images depends on some mixture of ‘perceptual richness and nearly infinite detail’ (Gunning 2004, 45). The referential significance of images collected under platform conditions is something different.

Visual theorists have long debated how photographic referentiality, its capacity to connect directly to a referent, a person, thing or state of affairs in the world, works. In semiotic accounts, the referential import of a photograph depends on some mixture of resemblance (or iconic signs, to use the philosopher C.S. Peirce’s term) and indexical signs, signs that lay some path between sign and referent. As Mary Anne Doane suggests, despite the visual primacy of resemblance, indexicality anchors photographic referentiality:

resemblance may occur, but it is not necessary to the functioning of the index as sheer evidence that something has happened, that something exists or existed (Doane 2007, 135).

While resemblance depends on perceptual similarity between image and referent, the indexical component of a photograph relies on the bundled chain of modifications and local connections we imagine, believe or know run back through image formats, metadata (latitude, longitude) databases, satellites and their cameras pointing towards some patch of earth.3 Phenomenological accounts of photographs tend to downplay the very idea of a photographic sign in favour of what Gunning refers to as ‘putting us into the presence of something’ (Gunning 2004, 48).

The referentiality of image collections, however, differs from that of single images, whether understood semiotically or phenomenologically. Individual images are still observable in the collections (see Figure 1), but the collection is not. HRSL operates in a platform-infrastructural zone where images are aggregated directly in databases rather than screened or displayed. The functional statements it generates concerning the existence and position of buildings in particular places are referential, but no longer in the mode of either semiosis or experiential ‘sheer evidence’ (Gunning 2004, 45).

In order to explore what platform image collections might be or do, I will follow the subsidence of images or their withdrawal from perceptual observation through what I term ‘embedding.’ The term, as we will see, has various spatial, visual, mathematical and technical senses, all concerned with containing or setting something within a surrounding mass. In their account, the Facebook Connectivity Lab engineers speak rather technically of a ‘fixed dimensional feature embedding for all images.’ They refer here to what mathematicians call an ‘embedding.’ An embedding ‘incorporate[s] (a mathematical structure) in a larger structure while preserving all important structural features, esp. by the use of a function which is an embedding’ (Dictionary 2018). As they construct and run the HRSL model, the Facebook engineers rely on another embedding, this time of devices within the platform. Electronics engineers speak of a device operating within a broader system or platform design as an embedded system, and HRSL relies on an ’image recognition engine’ implemented using specialized GPU (Graphics Processing Unit) hardware located in Facebook’s data centres. Moving slightly back, but not far, from the mathematical and hardware senses of embedding, we might also consider human observers, the subjects who experience image collections, as embedded. In his account of transformations of seeing in the nineteenth century, a time when vision was undergoing mechanisation, Jonathan Crary writes of the observer ‘embedded in a system of conventions and limitations’ (Crary 1992, 6). I will suggest that human observers are very much embedded in the likes of Facebook HRSL model, albeit compressed in ways that make it seem that as Svetlana Alpers puts it, ‘no scale or human measure is assumed’ (Alpers 1983, 37).

The different senses of embedding coalesce in the HRSL case and in the main argument I propose in this paper: platforms amass images that, re-processed in aggregate in deep learning models, become the fabric of a transformed referentiality. This aggregate referentiality supports the embedding of platforms into everyday states of affairs, whether in the planning of communication infrastructure as in Facebook’s HRSL, or in the face, object or scene detection systems widely used on adjacent social media platforms such as Instagram, Pinterest, Airbnb, Tiktok or SnapChat to individually target content or products. The embedding exemplified in HRSL locates images collections at the heart of the economic, social and cultural changes detailed in recent platform literature (Sriraman, Bragg, and Kulkarni 2017; Plantin and Punathambekar 2018; Van Dijck, Poell, and de Waal 2018).

Although I track the DigitalGlobe satellite images through Facebook and Columbia University’s HRSL model, large image collections are increasingly mobilised on platforms. The models that allow an autonomous car to drive in traffic, a robot to sort cucumbers by size and shape, a photo-sharing app to recognize and label individual faces by name, the DeepMind platform to outplay the world Go champion, or a moth to be identified on iNaturalist all rely on large image collections. The collections are platform endemic, and indeed, almost by definition, they don’t exist elsewhere. The economic, technical, social and cultural significance of the collections is hard to evaluate. Certainly the processing of image collections anchor social media expansion. They pervade contemporary science. Their significance extends well beyond social media, entering into cultural institutions, government, transport, logistics, manufacture and health. All of these settings amass images, often within specifically defined limits. Facebook’s observational practices in relation to the earth image collections, however, depend on relatively generic embeddings – abstract structures, devices, and modes of observation – in order to ground or anchor the platform in the world. Whether or not HRSL succeeds operationally is less important for my purposes than tracking how these embeddings anchor platforms.

Observing embedding in the post-archival platform

The observation of image collections is a crucial problem. Facebook’s HRSL model processes an image collection extracted from DigitalGlobe’s earth-imaging satellite–database platform (DigitalGlobe 2016).’ Individual images are relatively high-resolution. A DigitalGlobe image file, comprising an array of 2000 x 2000 pixels, exceeds one gigabyte. Given earth’s surface area of 510 trillion square metres, each of the 14.1 billion images in the DigitalGlobe image collection covers roughly 200 x 200 metres, or 10 pixels/metre. Such images are not viewed or circulated in the way that smartphone photographs might be on Instagram. A collection of DigitalGlobe images covering any significant geographical area requires petabyte storage and retrieval infrastructures at a platform scale.

All of this implies, fairly directly, that image collections are for the most part not seen, displayed, or screened as individual images. Would it be right to say that they are instead ‘just data’? This would be a problematic take on what happens in the HRSL and other such models because it limits conceptualisation of how human observers function in the model. The fact that individual images are not viewed by people does not mean that people are observers of the image collections modelled as HRSL. Jacob Gaboury has recently drawn attention to facets of digital images subtracted from human vision:

Reading the digital image in this way – as an object structured by a set of distinct material practices – allows us to move beyond discourses of simulation and the virtual to a theory of the digital image that is not visible in the rendered output of the screen, but which nonetheless structures and limits our engagement with computational technology (Gaboury 2015, 41–42).

Although single DigitalGlobe images can be rendered on screen (see Figure 1), the observing done in HRSL and many other such models concerns the collection. The typically massive scale of contemporary image collections leverages rather than obscures, in this mode of observation, platform capacity to see.

Any change in seeing done by platforms is accompanied by changes in what people do. A useful lead on how we might approach a re-distribution of seeing comes from historical work on transformations in practices of looking at images. In his account of proto-photographic practices in Techniques of the Observer, Jonathan Crary argues that transpositions or crossovers between human observers and observing devices-instruments define nineteenth century visuality.(Crary 1992) Despite the radical alterations in observational techniques that intervene between then and now, the process of re-distributing observation continues. Reassigning the role of observer prepares the way for an untrammelled consumption of images in which we still bask:

autonomization of sight, occurring in many different domains, was a historical condition for the rebuilding of an observer fitted for the tasks of “spectacular” consumption (Crary 1992, 19)

This is a useful lead. Crary points to prior partial de-couplings of observation from human eyes, a de-coupling that today leads to image recognition engines processing huge numbers of images, images whose classic semiotic or phenomenological referentiality activates only briefly if at all. The idea that seeing has been re-constructed before in mechanisms and devices alerts us to the need to map emerging configurations that couple people and things like ‘image recognition engines.’ The summary point is that degree and type of autonomization of vision is always depends on the configuration and positioning of observers.

It may be, however, that the re-configured observers of platform image collections occupy embedded or niche positions, thereby making it hard to discern the observing they do. If the predictive embeddings of platform image collections re-configure observation, we would need to ask: (1) how does the re-composition of image collections in HRSL-style predictions position the observers who endow the referentiality of images, and who are the way that images relate back to the world they came from (whether that be a world defined as a particular social field of friends and celebrities, or a world defined as the patterns of human settlement in built environments)? (2) How do vectors of predictive referentiality shift or re-align platforms in the world?

To sum up so far, I have suggested that the processing of large image collections on platforms is not for the most part directed at signifying or visual experience. Instead, in the increasingly common predictive models, the image recognition engines, and the deep learning models to which we now turn, seeing is re-distributed in a differential observational assemblage. The models themselves pre-date social network media or platform society more generally (Van Dijck, Poell, and de Waal 2018), but under platform conditions they have proliferated. The deep neural net model, we will now see, re-weaves image collections through an intricate fabric of indexical relations. The three senses of embedding as a mathematical structure that compresses something larger into something smaller, as a device surrounded by larger system, and as the positioning of observers within a system of constraining conventions, point to a way of understanding of how this re-weaving changes platforms and also our potential engagement with platforms.

Platform-device embeddings

The Facebook Connective Lab HRSL model uses Facebook’s ‘image recognition engine,’ a ‘state of the art convolutional neural network’:

We used state-of-the-art convolutional neural networks to develop a model that has the accuracy to detect individual buildings and simultaneously works on satellite images across the globe. (Tiecke 2016)

Convolutional neural nets are more than two decades old (Lecun et al. 1998). Their recent re-emergence and popularisation attests to platform-scale image collections. What happens to neural nets in their encounter with platform image collections is instructive. Neural net models such as HRSL do not fit well in the standard networked architectures of information retrieval. Platform architectures are built around data centres, typically comprising tens of thousands of identical commodity computing servers. Such architectures were designed for retrieval and distribution of text and images for display to myriad individual users on the internet.4 Predictive models require a different flow of images. The HRSL model uses specialised hardware built by Facebook called Big Sur. Located in a major data centre in Prineville, Oregon (and reportedly later introduced at other Facebook data centres), Big Sur packs commodity graphics processing units (GPUs) made by the Taiwanese manufacturer, Nvidia, into a standardised ‘Open Rack’ component (Facebook 2016). The advent of Big Sur (and its successor Big Basin) in the data centres signposts new directions and dimensions of movement for image collections.

It is no accident that Big Sur lines up rows of GPUs (Graphics Processing Units). The GPUs, Nvidia Tesla cards (Nvidia 2016), consist of an on-chip grid of processing devices – cores – specialised in generating computer graphics for gaming and design. Developed in the late 1990s to support real-time computer games such as Quake, where the frame-rates of first-person viewpoint graphics depend on the movement of millions of small polygons (triangles, rectangles, etc.) according to the constraints of perspectival geometry, lighting and shading, GPUs calculate positions and shapes of polygons in movement. The device architecture has one main purpose: to accelerate the rendering of images (frames) by processing many polygons and pixels in parallel.

Big Sur outputs none of the typical GPU’s gaming, CGI cinema or CAD images. Rather than outputting images for display or interactivity, the GPUs input images drawn from the platform around them. This reversal of GPUs, their conversion into observational instruments, began in the early 2000s as scientists adapted them for scientific modelling of complex biochemical structures such as protein folding. Such uses continue to expand, and GPU manufacturers such as NVIDIA now produce a variety of GPUs specialised for data centre operation.5 The bare fact of the embedding of the GPU’s grid of graphic processors in Facebook, Google, Baidu or Microsoft data centres does not tell us how images are transformed there. It points, however, to the fact that something quite different to retrieval of images for display is happening there.

Embedding observers in platforms

We can track the GPU-accelerated trajectories of platform images by considering who or what observes them. Crary defines the observer as one who sees within prescribed possibilities: observers are ‘embedded in a system of conventions and limitations’ (Crary 1992, 6). In the processing of platform image collections in deep neural net models, where do observers stand? Both human and non-observers shift in these models. The position of human observers in these models has changed. Beginning in the 1980s, neural nets and other pattern recognition devices were crafted by computer vision and machine learning researchers with intimate knowledge of both their image datasets (often scans of handwritten digits as in the MNIST (LeCun, Cortes, and Burges 2010) datasets or photographs of human faces) and the workings of their models. The modellers’ craft was observational as well as constructive. They developed a sensibility for the tuning of model parameters, for sources of error, and for overfitting, the tendency of a neural net to conform too closely to the features of some images it has observed and consequently mis-recognise the features of images it has not seen.

Despite many announcements of autonomous AI, predictive models still very much take shape under human observation or supervision. Some of the early neural network researchers find themselves working 25 years later on platforms such as Facebook (Yann Le Cun) or search engines such as Google (Geoff Hinton). In early neural net models, the process of building models required extensive trials of variations in network architecture and parameters. Researchers compared architectures, and comparing architectures entailed creating many models, whose performance could be measured and observed in various ways. These architectural comparisons had their own visual culture, usually in the form of data graphics that plotted changes in error rates over time and for several competing models.

Then, as now, modelling has its own observational practices. Modellers present some well-defined subset or sample of the images to the model during a training phase, and the parameters of the network are calculated. During the training phase, observation attends to rates of convergence: how quickly do parameters settle on fixed values. A slow convergence suggests the model is not learning well from the image data. Rapid convergence suggests that the model has fixed on some very prominent features in the data and will miss other important features. In a second phase of observation, modellers supply a fresh set of images (the ‘test’ data), and then measure the accuracy of prediction on the new image dataset. The test phase might also compare the model to others whose error rates or rates of convergence have been established elsewhere (‘benchmarks’). Such comparisons might suggest modifications to the model (changes in key parameters or architecture: more layers, different numbers of elements in each layer, different connection between layers).

We must recognised that even the pre-platform neural net models set up a field of non-human observers, vastly more numerous, alongside the expert humans. The crafting of predictive seeing was re-distributing observation to devices. In the platform configurations exemplified by HRSL, a convoluted architecture of observation has stabilised and become somewhat routinised. Architectural diagrams of convolutional neural nets such as HRSL convey something of the piling up of non-human observers (see Figure 2). They typically show an image presented at the left-hand side as a flat array of numbers. Images move along the lines that connect a succession of cubical layers in parallel in the middle, and one or several planes labelled as ‘output layer’ on the right, where inferences or predictions of what the image shows are generated. Arrows running left to right indicate that images move through successive layers of functional mapping towards the output layer where predictions are produced. The layers of the architecture continue multiply in the recent models.

Figure 2: The AlexNet architecture

Figure 2: The AlexNet architecture

Figure 3: 11 x 11 window on red plane of input image

Figure 3: 11 x 11 window on red plane of input image

In what sense is the model observing images? Even in the schematic abstraction of a diagram, we can see that the model architecture channels images through many stages of partial observation.6 These stages segment or cut images on many different scales in order to isolate elements that might point to features that support inference. Many observational techniques lay grids over images. Within layers, convolutional neural nets also multiply views of the image according to a grid too. For instance, the observation of an image in a convolutional neural net begins with a ‘convolutional layer.’ A convolutional layer ‘observes’ incoming images through a sliding window that crops the image into a series of overlapping tiles. The influential AlexNet architecture assumes that the input image has been scaled to dimensions 224 x 224 pixels. The convolutional window is 11 x 11 pixels, and moves across the image in four pixel steps to generate 3025 (55 x 55) cropped observations of the image. The cropped 11 x 11 sub-images observed by the first layer (see Figure 3) are themselves observed by successive layers of the network at different scales (5x5, 3x3, etc.).

Metaphorically, the layered architecture might be imagined as a series of lens in an apparatus, interspersed with filters, prisms and slits that spread out and narrow the image into a spectrum of sub-image views for many slightly differently positioned observers. The convolutional neural net stages a kaleidoscopic series of partial observations of images arranged so that successive re-observations effect higher levels of abstraction. The architecture of the model responds to the problem of observing many images by arranging many partial observers, embedded in layers. Each takes a view whose significance in the model’s overall mapping of images to predictive statements must be calculated.7

The deep learning models seem to threaten referentiality in any of its resembling, indexical or sheer evidential modes. Not only are there typicall millions of images in the collections, but the model kaleidoscopically multiplies observers of images. The millions of unit ‘neurons’ in the net each generate partial observation of the images (for example, Layer 1 in alexnet has 55 x 55 x 42 x 2 or 255,000 units, each with an input of 11 x 11 or 121 variables). The architecture of the net presents a highly organised flow of observations, connected to each other in varying degrees (see Figure 2, where some layers are labelled as dense because the units are fully connected, and others are only partially connected). Nothing in this network of partial observers, however, seems to align partial observations towards a simple referential statement such as ‘this is a house.’ What overview, summary, accumulation or synthesis could possibly move from so many partial observations to statements of affairs in the world?

Organising observers through convergence

The mathematical notion of embedding – the inclusion of the structural features of a larger group in a smaller group – points to a general organizing synthesis running through the model. The architecture of the model, its layers of partial observers, although numerous, reduces the sheer visual diversity of the image collection to a smaller, albeit numerous, group of observations whose individual contribution to the ongoing relation between image and a reality that precedes it is a matter of intense interest. During the ‘learning’ phase, the operation of embedding the image collection in the model transforms or maps the widely varying elements of the images onto the fewer, albeit still numerous, varying parameters or ‘weights’ in the deep learning model. The computational achievement of that mapping for large collections of images, with their seemingly inexhaustible appetite for things, places, light, points of view, and faces, has kindled heated interest in AI more generally.

What is called ‘learning’ in deep learning or machine learning more generally is the derivation of the embeddings, the distillation of observations capable of referentially grounding images in pre-existing realities. During its ‘learning’ or ‘training’ phase, each time the model moves an image between input and output layers, it measures differences between the expected output based on prior observations (usually in the form of tags on photographs) and the current output of the model, usually a probability that can formulated as a statement. The ‘learning’ done by the model’s algorithm is a gradual adjustment of the parameters of each of the many partial observers or units of the neural net to align actual output to expected output. While there are many technical and mathematical details in play during learning, the repeated observing and adjusting tends to minimise the differences (‘losses’) between reference and inference, between what is already known and what is predicted. The learning process relies on measures of convergence to determine just how much and how long to keep adjusting the parameters of the many partial observers. Convergence is a limit state in which differences between expected and actual outputs have stabilised at a minimum value.

The production of convergence, a state of affairs in which the parameters of all the elements of the model no longer change, and in which predictions align with known observations, comes at a cost. Deep learning demands large image collections with labels or tags carrying referential weight in order to reach convergence.The observational capacity of HRSL depends on the prior convergences embedded in Facebook’s ‘entity recognition engine’ and the original labelled images that Facebook at some point used in training. This prior image collection, with known referentiality, grounds the model in the world as perceived by obscure observers, presumably Facebook users, sometime in the past. HRSL and its ilk stand on observations sedimented in previous bouts of learning. Convergence, or finding the optimal alignment of partial observers, in a deep learning model such as the HRSL demands the platform infrastructure, the computational parallelism of Big Sur, the architecture of the convolutional neural net. The hundreds of thousands of GPU cores, and the myriad images respond to the problem of producing convergence between the different partial views of the images generated by the neural net.

This is a radical shift in referentiality. Embedding the many variations in the image collection as calculated weights of the model, machine learning derives an aggregate referentiality comprising many indexical relations directed at different visual elements such as lines, curves, colour gradients, and edges. The embedding of image collections in deep learning models has little resemblance to images observed on screen even it is rooted in such observation some time in the past. A deep learning model such as HRSL records in its calculated parameters patterns that index the world through images. As numerical values, the parameters could measure many different things, but in the context of HRSL they weight the multi-scale observations of the image generated by deep learning architectures such as AlexNet.

We might more broadly view embedded indexicality is an after-image or trace-memory of the collective processing of images through the training phases. When the computationally intensive training (or ‘learning’) of a model has been completed, any new image presented to the model will activate elements of the model according to those traces. As DigitalGlobe satellite images feed in, different elements of the HRLSL model activate in response, outputting probability values that locate buildings. To say that ‘the model observes new images’ means that activated by them, a model traces in their visual richness the marks of previously seen images.

Conclusion: the referential grounding

What do we learn from the models designed to work on large image collections? Although my account has not broached the different social interests, the problems of power, the associated processes of commodification or surveillance, or the rising tide of images in everyday lives, the description of embeddings I propose here might inform work on these facets.

We can expect versions of these models to proliferate and become ordinary. Amazon and Google announce cameras incorporating deep learning models for faces, animals and things directly in small devices (Barr 2017; Rutledge 2017). Announcements of social media, autonomous vehicles, facial recognition, weapons and apps and their platforms depend more and more on image collections. Announcements of ‘mechanised abstracting procedures’ (Daston and Galison 1992, 103) for large image collections have a significance beyond platforms and their business models.

Read literally, HRSL shows how a platform infrastructuralises. The ‘High Resolution Settlement Layer’ models built environments of the world, environments that directly affect where in the world Facebook infrastructures will be located. Large-scale demonstrations of predictive epistemic power – all the buildings on earth, all the cats on YouTube, the names of all the faces on Instagram – econonomise platforms more generally. The capacity to ground predictions in images extracts surplus value from the referential richness of images. All platform modelling, whether in the service of recommendation, navigation, advertising or filtering, draws on this referential capital, historically accumulated in the visualities of science, popular culture and media industries.

We could note too that HRSL has a promissory, figurative effect. Spectacular performances such as Facebook’s HRSL model engender belief in the future of the platform. Such models feed forward into speculative platform economies in which growth in numbers of users, increasing revenues and especially the platform’s potential for future expansion are pivotal financial concerns. Predictive models supported by image collections promise platform access to the fabric of experience, organisational life, government, science and communication, or to any social field in which images have a referential function.

What about the operational life of such models on platforms? Here the lesson is more complicated. HRSL generates statements concerning the location of buildings on earth in order to position a platform, Facebook, on the earth. The increasingly routine reconfiguration and re-mapping of image collections using deep learning models configures platforms in ways that media and technology scholars are just beginning to map. This multi-faceted positioning and supporting of platforms through image-derived referentiality motivates my efforts to conceptualise them in terms of embeddings. The senses of embedding explored in this article range from the recent installation of image processing devices such as Big Sur GPUs in platform data centres, the re-distribution of human and non-human observers within model architectures, and the embedding of certain cross-cutting features of image collections in model parameters. These different senses of embedding interlock, combining architectural and observational practices, to animate the sometimes startling predictive agency of platforms.

Given this layering and interweaving of humans, devices, abstractions and infrastructures, what happens to image collections in HRSL and platforms more generally is perhaps not well understood in terms of seeing, or in terms of the semiotic or phenomenological apprehension of images. We imagine perhaps that platforms are a portal through which we see things. So many platforms intensely re-socialise seeing. But these models, and again HRSL is just one of many platform-embedded device-observer-convergence practices, re-define what images do. The interlocking embeddings peculiar to platforms imbue collected images with an aggregate referentiality. Not only is this treatment of image collections peculiar to platform, it operates, perhaps more importantly, to position platforms in the world. The large-scale movement of images, their dis-assembly and re-tracing as features inside the data centre GPU-based models, and the many partial observations by algorithms focused on convergence, is not seeing as we know it. It is better understood as ways of embedding platforms in the world. The embeddings I have been describing – devices and infrastructures, architectures for human and non-human observers, mathematical derivations of compressed structure – connect platforms to places, faces, feelings and propensities in the world.

Under platform conditions, then, images overflow their viewing on screens, on walls or pages and become threads in a densely woven indexical fabric. Indexical expansion, part platform-architecture, part trace-structure of an image collection, shapes new vectors of action and movement, which we see in suggestions, automatically tagged images, faces labelled with names, and a myriad of other predictive forms and affordances. In their model embeddings, image collections, alongside media-formatted speech, text, and other data forms, open pathways for action indexed to everyday states of affairs in the world.

References

Alpers, Svetlana. 1983. The Art of Describing : Dutch Art in the Seventeenth Century. Chicago: University of Chicago Press.

Barr, Jeff. 2017. “AWS DeepLens – Get Hands-on Experience with Deep Learning with Our New Video Camera.” Amazon Web Services. November 29, 2017. https://aws.amazon.com/blogs/aws/deeplens/.

Crary, Jonathan. 1992. Techniques of the Observer: On Vision and Modernity in the Nineteenth Century. MIT press.

Cubitt, Sean. 2014. The Practice of Light: A Genealogy of Visual Technologies from Prints to Pixels. Cambridge, MA: MIT Press.

Daston, Lorraine, and Peter Galison. 1992. “The Image of Objectivity.” Representations, no. 40: 81–128. http://www.jstor.org/stable/2928741.

Dictionary, Oxford English. 2018. “Embed | Imbed, V.” OED Online. Oxford University Press. http://www.oed.com.ezproxy.lancs.ac.uk/view/Entry/60835.

DigitalGlobe. 2016. “See a Better World with High-Resolution Satellite Imagery.” Digital Globe. 2016. https://www.digitalglobe.com/.

Doane, Mary Ann. 2007. “The Indexical and the Concept of Medium Specificity.” Differences 18 (1): 128–52.

Facebook. 2016. “Facebook to Open-Source AI Hardware Design.” Facebook Code. 2016. https://code.facebook.com/posts/1687861518126048/facebook-to-open-source-ai-hardware-design/.

Gaboury, Jacob. 2015. “Hidden Surface Problems: On the Digital Image as Material Object.” Journal of Visual Culture 14 (1): 40–60.

Gros, Andreas. 2016. “Connecting the World with Better Maps.” Facebook Code. February 22, 2016. https://code.facebook.com/posts/1676452492623525/connecting-the-world-with-better-maps/.

Gunning, Tom. 2004. “PLENARY SESSION II. Digital Aestethics. What’s the Point of an Index? Or, Faking Photographs.” Nordicom Review 25 (1-2): 39–49.

Halpern, Orit, Joshua Malitsky, and Oliver Gaycken. 2012. “Perceptual Machines: Communication, Archiving, and Vision in Post-War American Design.” Journal of Visual Culture 11 (3): 328–51. https://doi.org/10.1177/1470412912455619.

Kember, Sarah. 2014. “Face Recognition and the Emergence of Smart Photography.” Journal of Visual Culture 13 (2): 182–99. https://doi.org/10.1177/1470412914541767.

Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. 2012. “Imagenet Classification with Deep Convolutional Neural Networks.” In Advances in Neural Information Processing Systems, 1097–1105. http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.

Lab, Facebook Connectivity, and Center for International Earth Science Information NetworkCIESIN. 2017. “High Resolution Settlement Layer (HRSL).” Facebook Connectivity Lab and Center for International Earth Science Information Network - CIESIN. 2017. https://ciesin.columbia.edu/data/hrsl/.

LeCun, Yann, Yoshua Bengio, and Geoffrey Hinton. 2015. “Deep Learning.” Nature 521 (7553): 436–44.

LeCun, Yann, Corinna Cortes, and Christopher JC Burges. 2010. “MNIST Handwritten Digit Database.” AT&T Labs [Online]. Available: Http://Yann. Lecun. Com/Exdb/Mnist 2.

Lecun, Y., L. Bottou, Y. Bengio, and P. Haffner. 1998. “Gradient-Based Learning Applied to Document Recognition.” Proceedings of the IEEE 86 (11): 2278–2324.

Mackenzie, Adrian. 2008. “Internationalization.” In Software Studies, 153–60. Cambridge, MA: MIT Press.

Mackenzie, Adrian, and Anna Munster. 2019. “Platform Seeing: Image Ensembles and Their Invisualities.” Theory, Culture & Society, June, 0263276419847508. https://journals.sagepub.com/doi/abs/10.1177/0263276419847508.

Manovich, Lev. 2001. The Language of New Media. MIT press.

Marres, Noortje. 2017. Digital Sociology: The Reinvention of Social Research. 1 edition. Malden, MA: Polity.

Metz, Cade. 2016. “AI Helps Facebook’s Internet Drones Find Where the People Are.” WIRED. February 22, 2016. https://www.wired.com/2016/02/facebook-ai-shows-internet-drones-where-all-the-people-are/.

Microsoft. (2018) 2018. USBuildingFootprints: Computer Generated Building Footprints for the United States. Microsoft. https://github.com/Microsoft/USBuildingFootprints.

Miller, Daniel, and Sinanan Jolynna. 2017. Visualising Facebook. London: UCL Press.

Nvidia. 2016. “NVIDIA Tesla V100.” NVIDIA. 2016. https://www.nvidia.co.uk/data-center/tesla-v100/.

Plantin, Jean-Christophe, Carl Lagoze, Paul N Edwards, and Christian Sandvig. 2016. “Infrastructure Studies Meet Platform Studies in the Age of Google and Facebook.” New Media & Society, August, 1461444816661553. https://doi.org/10.1177/1461444816661553.

Plantin, Jean-Christophe, and Aswin Punathambekar. 2018. “Digital Media Infrastructures: Pipes, Platforms, and Politics.” Media, Culture & Society, December, 0163443718818376. https://doi.org/10.1177/0163443718818376.

Rutledge, Billy. 2017. “Introducing AIY Vision Kit: Make Devices That See.” The Keyword. November 30, 2017. https://blog.google/topics/machine-learning/introducing-aiy-vision-kit-make-devices-see/.

Shankland, Stephen. 2018. “Your Instagram Hashtags Helped Facebook’s AI Get Smarter at Photo Recognition - CNET.” CNET. May 2, 2018. https://www.cnet.com/news/instagram-hashtags-help-smarter-photo-recognition/.

Sriraman, Anand, Jonathan Bragg, and Anand Kulkarni. 2017. “Worker-Owned Cooperative Models for Training Artificial Intelligence.” In Companion of the 2017 ACM Conference on Computer Supported Cooperative Work and Social Computing, 311–14. CSCW ’17 Companion. New York, NY, USA: ACM. http://doi.acm.org/10.1145/3022198.3026356.

Sterne, Jonathan. 2012. MP3: The Meaning of a Format. Durham, N.C.: Duke University Press.

Tiecke, Tobias. 2016. “Open Population Datasets and Open Challenges.” Facebook Code. November 26, 2016. https://code.facebook.com/posts/596471193873876/open-population-datasets-and-open-challenges/.

Van Dijck, José. 2013. “Facebook and the Engineering of Connectivity: A Multi-Layered Approach to Social Media Platforms.” Convergence: The International Journal of Research into New Media Technologies 19 (2): 141–55. http://con.sagepub.com/content/early/2012/09/17/1354856512457548.abstract.

Van Dijck, José, Thomas Poell, and Martijn de Waal. 2018. The Platform Society: Public Values in a Connective World. Oxford University Press.

Zuckerberg, Mark. 2016. “Building Jarvis.” December 20, 2016. https://www.facebook.com/notes/mark-zuckerberg/building-jarvis/10154361492931634/.


  1. Soon afterwards, Microsoft made a similar announcement: it was releasing the ‘footprints’ of 100 million buildings in the USA as a large image collection generated by a deep neural net (Microsoft [2018] 2018). The years 2016-2019 were replete with platform-based image-labelling projects.

  2. It would be inadequate to understand the Facebook HRSL Model as an example of the mechanisation, automation or even ‘algorithmicisation’ of vision. Although mechanisms, forms of automation and algorithms certainly play a role in the contemporary life of large image collections, concepts such as mechanisation, automation and algorithm remain too rigid to grasp the multiple senses of embedding (platform, device, abstraction, containment, surrounding mass) and the re-configuration of referentiality taking shape in the midst of large image collections.

  3. The data format used in raster images such as DigitalGlobe’s satellite photos is TIFF (Tagged Image File Format). The TIFF graphic data format comprise at core a grid or matrix of numerical values that describe the colours of each picture element or pixel. A typical DigitalGlobe images has 1900 x 1900 or approximately 6.5 million numbers.

  4. Data centre architectures are somewhat standardised. A set of images posted by Mark Zuckerberg, Facebook’s CEO, on his pages in 2016 show a new data centre in Lulea, far in north Sweden (Zuckerberg 2016). The images display a grid of computers or servers. A long hall is filled with rows of identical server racks. Each rack has a column stacked with servers or generic computers.

  5. The proliferation of GPUs in science and now data centres has some analogies with the spread of graphic images in scientific atlases in 19th century. In their account of ‘mechanical objectivity’, Lorraine Daston and Peter Galison stress the transverse movement of these graphics: ‘graphical representation could cut across the artificial boundaries of natural languages to reveal nature to all people, and graphical representation could cut across disciplinary boundaries to capture phenomena as diverse as the pulse of a heart and the downturn of an economy. Pictures became more than merely helpful tools; they were the words of nature itself’ (Daston and Galison 1992, 116).

  6. The architecture of the convolutional neural net was loosely motivated at the time of its inception in the early 1990s by the neurobiology of vision, or at least, a set of findings concerning primate vision published in the 1950s (Lecun et al. 1998). Relatively old biological knowledge, not at all the state-of-the-art, guides the construction of observational devices such as neural nets. Neural networks as computational models stem more generally from the post-WWII cybernetic sciences (Halpern, Malitsky, and Gaycken 2012). The multi-layer perceptron, as neural nets are still sometimes termed, dates back to the work of Frank Rosenblatt at the Cornell Aeronautics Laboratory, USA, in the 1950s. Neurobiological motivations or intuitions stand at a good distance from the operations supported by the architecture of the convolutional neural net. As Yann Le Cun, Director of AI at Facebook, and one of the early developers of convolutional neural nets comments “I don’t like people saying this because, while Deep Learning gets an inspiration from biology, it’s very, very far from what the brain actually does” (LeCun, Bengio, and Hinton 2015).

  7. In the output layer, shown on the left of Figure 3, 1000 units are included. In this architecture, these units correspond to 1000 categories of images in the ImageNet dataset (Krizhevsky, Sutskever, and Hinton 2012). In the HRSL model, the output layer contains only two units: True and False for presence of a building.